SageMaker HyperPod の 自動ノード復旧を試してみる
こんにちは!クラウド事業本部コンサルティング部のたかくに(@takakuni_)です。
先日、 Amazon EKS が自動ノード修復機能をサポートしました。ブログも出ていて良きですね。
実は EKS オーケストレーター版 SageMaker HyperPod にも似たような機能(自動ノード復旧)があり、このアップデートが出る前から利用できていました。
せっかくなので、今回は SageMaker HyperPod の自動ノード復旧を試してみたいと思います。
自動ノード復旧
自動ノード復旧は名前の通り、 SageMaker HyperPod クラスター内のノード(インスタンス)に障害が発生した場合に、ノードを自動的に再起動または交換してくれる機能です。
ドキュメントを読む限り、 EKS オーケストレーターのみ対応しているようです。
SageMaker HyperPod の自動ノード復旧の死活監視項目は以下の 3 つで構成されます。
- SageMaker HyperPod health-monitoring agent による死活監視
- ベーシックチェック
- ディープヘルスチェック
EKS の自動ノード修復機能は eks-node-monitoring-agent(アドオン)を利用しますが、SageMaker HyperPod は SageMaker HyperPod health-monitoring agent を利用します。
監視項目もかなり異なりますのでご注意ください。
CPU を利用したノードはサポートしていない点にも注意です。
Node auto-replacement is not supported for CPU instances.
設定方法
クラスター単位で設定しクラスターの作成時または、既存クラスターから更新できます。
基本的には有効でご利用いただくのが良いかと思います。ちなみに None するとノードにラベルは付与するようでした。
Automatic node recovery runs when issues are found from health-monitoring agent, basic health checks, and deep health checks. If set to None, the health monitoring agent will label the instances when a fault is detected, but it will not automatically initiate any repair or recovery actions on the affected nodes. This option is not recommended.
やってみる
それでは SageMaker HyperPod のクラスターを作成したいと思います。
自動ノード修復機能は NodeRecovery
で有効化/無効化を切り替えます。
今回は SageMaker HyperPod health-monitoring agent に対して障害を引き起こすコマンドを実行するため、 GPU インスタンス(ml.g5.xlarge
)を利用します。[1]
{
"ClusterName": "ml-cluster",
"Orchestrator": {
"Eks": {
"ClusterArn": "arn:aws:eks:us-west-2:123456789012:cluster/hyperpod-eks-cluster"
}
},
"InstanceGroups": [
{
"InstanceGroupName": "worker-group-1",
+ "InstanceType": "ml.g5.xlarge",
"InstanceCount": 1,
"InstanceStorageConfigs": [
{
"EbsVolumeConfig": {
"VolumeSizeInGB": 500
}
}
],
"LifeCycleConfig": {
"SourceS3Uri": "s3://hyperpod-eks-bucket-123456789012-us-west-2",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::123456789012:role/hyperpod-eks-ExecutionRole-us-west-2",
"ThreadsPerCore": 2
}
],
"VpcConfig": {
"SecurityGroupIds": ["sg-0a0ab81a7745a00da"],
"Subnets": ["subnet-091a8cf419deb16e8"]
},
+ "NodeRecovery": "Automatic"
}
HyperPod クラスターの ARN が返ってきているため、うまく作れているようです。
[cloudshell-user@ip-10-144-112-82 ~]$ cat > cluster-config.json << EOL
> {
> "ClusterName": "ml-cluster",
> "Orchestrator": {
> "Eks":
> {
> "ClusterArn": "arn:aws:eks:us-west-2:123456789012:cluster/hyperpod-eks-cluster"
> }
> },
> "InstanceGroups": [
> {
> "InstanceGroupName": "worker-group",
> "InstanceType": "ml.g5.xlarge",
> "InstanceCount": 1,
> "InstanceStorageConfigs": [
> {
> "EbsVolumeConfig": {
> "VolumeSizeInGB": 500
> }
> }
> ],
> "LifeCycleConfig": {
> "SourceS3Uri": "s3://hyperpod-eks-bucket-123456789012-us-west-2",
> "OnCreate": "on_create.sh"
> },
> "ExecutionRole": "arn:aws:iam::123456789012:role/hyperpod-eks-ExecutionRole-us-west-2",
> "ThreadsPerCore": 2
> }
> ],
> "VpcConfig": {
> "SecurityGroupIds": ["sg-0a0ab81a7745a00da"],
> "Subnets":["subnet-091a8cf419deb16e8"]
> },
> "NodeRecovery": "Automatic"
> }
> EOL
[cloudshell-user@ip-10-144-112-82 ~]$
[cloudshell-user@ip-10-144-112-82 ~]$ aws sagemaker create-cluster \
> --cli-input-json file://cluster-config.json \
> --region $AWS_REGION
{
"ClusterArn": "arn:aws:sagemaker:us-west-2:123456789012:cluster/499uv59u9tzq"
}
5,6 分ほど経過したのちにクラスター、インスタンスが InService, Running になりました。
kubectl でノードの状態を確認しておきます。
sagemaker.amazonaws.com/node-health-status
ラベルが Schedulable に設定されていますね。
[cloudshell-user@ip-10-144-112-82 ~]$ aws eks update-kubeconfig --name hyperpod-eks-cluster
Updated context arn:aws:eks:us-west-2:123456789012:cluster/hyperpod-eks-cluster in /home/cloudshell-user/.kube/config
[cloudshell-user@ip-10-144-112-82 ~]$ kubectl describe node
Name: hyperpod-i-04a55aa04ba2578a3
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=ml.g5.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-west-2
failure-domain.beta.kubernetes.io/zone=us-west-2b
kubernetes.io/arch=amd64
kubernetes.io/hostname=hyperpod-i-04a55aa04ba2578a3
kubernetes.io/os=linux
node.kubernetes.io/instance-type=ml.g5.xlarge
sagemaker.amazonaws.com/cluster-name=ml-cluster
sagemaker.amazonaws.com/compute-type=hyperpod
sagemaker.amazonaws.com/instance-group-name=worker-group
+ sagemaker.amazonaws.com/node-health-status=Schedulable
topology.k8s.aws/zone-id=usw2-az2
topology.kubernetes.io/region=us-west-2
topology.kubernetes.io/zone=us-west-2b
Annotations: alpha.kubernetes.io/provided-node-ip: 10.1.24.174
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 02 Jan 2025 15:02:59 +0000
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: hyperpod-i-04a55aa04ba2578a3
AcquireTime: <unset>
RenewTime: Thu, 02 Jan 2025 15:08:25 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Thu, 02 Jan 2025 15:04:30 +0000 Thu, 02 Jan 2025 15:02:51 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 02 Jan 2025 15:04:30 +0000 Thu, 02 Jan 2025 15:02:51 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 02 Jan 2025 15:04:30 +0000 Thu, 02 Jan 2025 15:02:51 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 02 Jan 2025 15:04:30 +0000 Thu, 02 Jan 2025 15:02:59 +0000 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.1.24.174
InternalDNS: ip-10-1-24-174.us-west-2.compute.internal
Hostname: ip-10-1-24-174.us-west-2.compute.internal
Capacity:
cpu: 4
ephemeral-storage: 104845292Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16181656Ki
nvidia.com/gpu: 1
pods: 14
Allocatable:
cpu: 3920m
ephemeral-storage: 95551679124
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15164824Ki
nvidia.com/gpu: 1
pods: 14
System Info:
Machine ID: ec2c8ce0e90f25e6683c10d7ca6c280b
System UUID: ec2ec4a8-4771-ff71-0ae5-32b06cf87d91
Boot ID: 72cc8921-3156-4cd4-bcab-35c1584ea1ad
Kernel Version: 5.10.228-219.884.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.7.23
Kubelet Version: v1.30.6-eks-94953ac
Kube-Proxy Version: v1.30.6-eks-94953ac
ProviderID: aws:///usw2-az2/sagemaker/cluster/hyperpod-499uv59u9tzq-i-04a55aa04ba2578a3
Non-terminated Pods: (10 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
aws-hyperpod health-monitoring-agent-wrs5r 500m (12%) 500m (12%) 512Mi (3%) 512Mi (3%) 5m29s
default hyperpod-dependencies-hyperpod-helm-chart-6f8989f9bb-bhg7p 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21m
default hyperpod-dependencies-mpi-operator-574c8c7f-544g6 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21m
kube-system aws-node-ggknb 50m (1%) 0 (0%) 0 (0%) 0 (0%) 5m29s
kube-system coredns-787cb67946-d4wdh 100m (2%) 0 (0%) 70Mi (0%) 170Mi (1%) 21m
kube-system coredns-787cb67946-v6996 100m (2%) 0 (0%) 70Mi (0%) 170Mi (1%) 21m
kube-system eks-pod-identity-agent-9ncmc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5m29s
kube-system hyperpod-dependencies-nvidia-device-plugin-9fj8p 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5m29s
kube-system kube-proxy-dsc9r 100m (2%) 0 (0%) 0 (0%) 0 (0%) 5m29s
kubeflow hyperpod-dependencies-training-operators-65dd9bb984-jq9km 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 850m (21%) 500m (12%)
memory 652Mi (4%) 852Mi (5%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 5m19s kube-proxy
Normal Starting 5m37s kubelet Starting kubelet.
Warning InvalidDiskCapacity 5m37s kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 5m37s (x2 over 5m37s) kubelet Node hyperpod-i-04a55aa04ba2578a3 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 5m37s (x2 over 5m37s) kubelet Node hyperpod-i-04a55aa04ba2578a3 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 5m37s (x2 over 5m37s) kubelet Node hyperpod-i-04a55aa04ba2578a3 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 5m37s kubelet Updated Node Allocatable limit across pods
Normal Synced 5m29s cloud-node-controller Node synced successfully
Normal NodeReady 5m29s kubelet Node hyperpod-i-04a55aa04ba2578a3 status is now: NodeReady
Normal RegisteredNode 5m25s node-controller Node hyperpod-i-04a55aa04ba2578a3 event: Registered Node hyperpod-i-04a55aa04ba2578a3 in Controller
[cloudshell-user@ip-10-144-112-82 ~]$
ワーカーノードへログインします。
curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh
chmod +x easy-ssh.sh
./easy-ssh.sh -c worker-group ml-cluster
ログインできていますね。
[cloudshell-user@ip-10-144-112-82 ~]$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 5594 100 5594 0 0 24909 0 --:--:-- --:--:-- --:--:-- 24973
[cloudshell-user@ip-10-144-112-82 ~]$ chmod +x easy-ssh.sh
[cloudshell-user@ip-10-144-112-82 ~]$ ./easy-ssh.sh -c worker-group ml-cluster
=================================================
==== 🚀 HyperPod Cluster Easy SSH Script! 🚀 ====
=================================================
Cluster id: 499uv59u9tzq
Instance id: i-04a55aa04ba2578a3
Node Group: worker-group
grep: /home/cloudshell-user/.ssh/config: No such file or directory
Would you like to add ml-cluster to ~/.ssh/config (yes/no)?
> yes
✅ adding ml-cluster to ~/.ssh/config:
cat: /home/cloudshell-user/.ssh/id_rsa.pub: No such file or directory
1. Detected SSH public key ~/.ssh/id_rsa.pub on the cluster. Skipping adding...
Now you can run:
$ ssh ml-cluster
Starting session with SessionId: takakuni-c2neynoz5ocvykt43nt4n4d9pa
sh-4.2#
IP アドレスを調べてみます。
TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`
curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/local-ipv4
少しわかりづらいですが、10.1.24.174
で動いているノードでした。
sh-4.2# TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 56 100 56 0 0 14285 0 --:--:-- --:--:-- --:--:-- 18666
sh-4.2# curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/local-ipv4
10.1.24.174sh-4.2#
障害を引き起こす
ノードにログインできたので、以下のコマンドで障害をテストします。
時間や IP は正確な時間を入力します。
sudo sh -c "echo 'Jan 2 15:14:10 ip-10-1-24-174 kernel: [ 378.703529] NVRM: Xid (PCI:0000:b9:00): 74, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)' >> /var/log/messages"
わずか数秒でインスタンスが Pending に遷移しました。素晴らしいですね。
Pending の間にノードの情報を確認します。
aws eks update-kubeconfig --name hyperpod-eks-cluster
kubectl describe node
Labels, Annotations, Taints の値が更新されたり、新しく追加されています。
[cloudshell-user@ip-10-132-44-216 ~]$ aws eks update-kubeconfig --name hyperpod-eks-cluster
Updated context arn:aws:eks:us-west-2:123456789012:cluster/hyperpod-eks-cluster in /home/cloudshell-user/.kube/config
[cloudshell-user@ip-10-132-44-216 ~]$ kubectl describe node
Name: hyperpod-i-04a55aa04ba2578a3
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=ml.g5.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-west-2
failure-domain.beta.kubernetes.io/zone=us-west-2b
kubernetes.io/arch=amd64
kubernetes.io/hostname=hyperpod-i-04a55aa04ba2578a3
kubernetes.io/os=linux
node.kubernetes.io/instance-type=ml.g5.xlarge
sagemaker.amazonaws.com/cluster-name=ml-cluster
sagemaker.amazonaws.com/compute-type=hyperpod
+ sagemaker.amazonaws.com/fault-reasons=NvidiaGpuXidError
+ sagemaker.amazonaws.com/fault-types=NvidiaErrorTerminate
+ sagemaker.amazonaws.com/instance-group-name=worker-group
+ sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement
topology.k8s.aws/zone-id=usw2-az2
topology.kubernetes.io/region=us-west-2
topology.kubernetes.io/zone=us-west-2b
Annotations: alpha.kubernetes.io/provided-node-ip: 10.1.24.174
node.alpha.kubernetes.io/ttl: 0
+ sagemaker.amazonaws.com/fault-details:
+ {"faults":[{"timestamp":"2025-01-02T15:14:10Z","type":"NvidiaErrorTerminate","reason":"NvidiaGpuXidError","message":"[ 378.703529] NVRM: X...
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 02 Jan 2025 15:02:59 +0000
+Taints: node.cloudprovider.kubernetes.io/shutdown:NoSchedule
+ node.kubernetes.io/unreachable:NoSchedule
+ sagemaker.amazonaws.com/node-health-status=Unschedulable:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: hyperpod-i-04a55aa04ba2578a3
AcquireTime: <unset>
RenewTime: Thu, 02 Jan 2025 15:14:13 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure Unknown Thu, 02 Jan 2025 15:09:36 +0000 Thu, 02 Jan 2025 15:14:54 +0000 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Thu, 02 Jan 2025 15:09:36 +0000 Thu, 02 Jan 2025 15:14:54 +0000 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Thu, 02 Jan 2025 15:09:36 +0000 Thu, 02 Jan 2025 15:14:54 +0000 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Thu, 02 Jan 2025 15:09:36 +0000 Thu, 02 Jan 2025 15:14:54 +0000 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: 10.1.24.174
InternalDNS: ip-10-1-24-174.us-west-2.compute.internal
Hostname: ip-10-1-24-174.us-west-2.compute.internal
Capacity:
cpu: 4
ephemeral-storage: 104845292Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16181656Ki
nvidia.com/gpu: 1
pods: 14
Allocatable:
cpu: 3920m
ephemeral-storage: 95551679124
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15164824Ki
nvidia.com/gpu: 1
pods: 14
System Info:
Machine ID: ec2c8ce0e90f25e6683c10d7ca6c280b
System UUID: ec2ec4a8-4771-ff71-0ae5-32b06cf87d91
Boot ID: 72cc8921-3156-4cd4-bcab-35c1584ea1ad
Kernel Version: 5.10.228-219.884.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.7.23
Kubelet Version: v1.30.6-eks-94953ac
Kube-Proxy Version: v1.30.6-eks-94953ac
ProviderID: aws:///usw2-az2/sagemaker/cluster/hyperpod-499uv59u9tzq-i-04a55aa04ba2578a3
Non-terminated Pods: (10 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
aws-hyperpod health-monitoring-agent-wrs5r 500m (12%) 500m (12%) 512Mi (3%) 512Mi (3%) 12m
default hyperpod-dependencies-hyperpod-helm-chart-6f8989f9bb-bhg7p 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28m
default hyperpod-dependencies-mpi-operator-574c8c7f-544g6 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28m
kube-system aws-node-ggknb 50m (1%) 0 (0%) 0 (0%) 0 (0%) 12m
kube-system coredns-787cb67946-d4wdh 100m (2%) 0 (0%) 70Mi (0%) 170Mi (1%) 28m
kube-system coredns-787cb67946-v6996 100m (2%) 0 (0%) 70Mi (0%) 170Mi (1%) 28m
kube-system eks-pod-identity-agent-9ncmc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 12m
kube-system hyperpod-dependencies-nvidia-device-plugin-9fj8p 0 (0%) 0 (0%) 0 (0%) 0 (0%) 12m
kube-system kube-proxy-dsc9r 100m (2%) 0 (0%) 0 (0%) 0 (0%) 12m
kubeflow hyperpod-dependencies-training-operators-65dd9bb984-jq9km 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 850m (21%) 500m (12%)
memory 652Mi (4%) 852Mi (5%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 12m kube-proxy
Normal Starting 12m kubelet Starting kubelet.
Warning InvalidDiskCapacity 12m kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 12m (x2 over 12m) kubelet Node hyperpod-i-04a55aa04ba2578a3 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 12m (x2 over 12m) kubelet Node hyperpod-i-04a55aa04ba2578a3 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 12m (x2 over 12m) kubelet Node hyperpod-i-04a55aa04ba2578a3 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 12m kubelet Updated Node Allocatable limit across pods
Normal Synced 12m cloud-node-controller Node synced successfully
Normal NodeReady 12m kubelet Node hyperpod-i-04a55aa04ba2578a3 status is now: NodeReady
Normal RegisteredNode 12m node-controller Node hyperpod-i-04a55aa04ba2578a3 event: Registered Node hyperpod-i-04a55aa04ba2578a3 in Controller
+ Normal NodeNotReady 42s node-controller Node hyperpod-i-04a55aa04ba2578a3 status is now: NodeNotReady
[cloudshell-user@ip-10-132-44-216 ~]$
ノードの Pod の待避が完了したのちに、新しいノードが起動しました。
インスタンス ID は i-0e606d201c423562c
で異なるインスタンスが起動していることがわかります。
こちらも数分したのちに Runnning になりました。入れ替わり完了ですね。
SageMaker HyperPod health-monitoring agent のログ(CloudWatch Logs)も確認してみます。
私がコマンドを実行して、一秒ほどで SageMaker HyperPod health-monitoring agent が検知しノードの入れ替えを試みています。
素早い検知スピードです。
2025-01-02T14:13:18.029Z {"level":"info","ts":"2025-01-02T14:13:18Z","msg":"NPD caught ","condition type: ":"KernelDeadlock","with condition details ":{"type":"KernelDeadlock","status":"False","transition":"2025-01-02T14:13:18.004033678Z","reason":"KernelHasNoDeadlock","message":"kernel has no deadlock"},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
2025-01-02T14:13:18.029Z {"level":"info","ts":"2025-01-02T14:13:18Z","msg":"NPD caught ","condition type: ":"NvidiaErrorTerminate","with condition details ":{"type":"NvidiaErrorTerminate","status":"False","transition":"2025-01-02T14:13:18.004033778Z","reason":"NvidiaNoErrorRequiredTerminate","message":"Nvidia no error required terminate"},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
2025-01-02T14:13:18.029Z {"level":"info","ts":"2025-01-02T14:13:18Z","msg":"NPD caught ","condition type: ":"NvidiaErrorReboot","with condition details ":{"type":"NvidiaErrorReboot","status":"False","transition":"2025-01-02T14:13:18.004033858Z","reason":"NvidiaNoErrorRequiredReboot","message":"Nvidia GPU reboot not required"},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
2025-01-02T14:13:18.029Z {"level":"info","ts":"2025-01-02T14:13:18Z","msg":"NPD caught ","condition type: ":"NeuronErrorTerminate","with condition details ":{"type":"NeuronErrorTerminate","status":"False","transition":"2025-01-02T14:13:18.004033918Z","reason":"NeuronNoErrorRequiredTerminate","message":"Neuron no error required terminate"},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
2025-01-02T14:13:22.801Z {"level":"info","ts":"2025-01-02T14:13:18Z","msg":"NPD caught ","condition type: ":"NeuronErrorReboot","with condition details ":{"type":"NeuronErrorReboot","status":"False","transition":"2025-01-02T14:13:18.004033978Z","reason":"NeuronNoErrorRequiredReboot","message":"Neuron no error required reboot"},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
+ 2025-01-02T14:23:11.137Z {"level":"info","ts":"2025-01-02T14:23:10Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2025-01-02T14:23:10Z","reason":"NvidiaGpuXidError","message":"Node condition NvidiaErrorTerminate is now: True, reason: NvidiaGpuXidError, message: \"[ 378.703529] NVRM: Xid (PCI:0000:b9:00): 74, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
+ 2025-01-02T14:23:11.137Z {"level":"info","ts":"2025-01-02T14:23:10Z","msg":"NPD caught ","condition type: ":"KernelDeadlock","with condition details ":{"type":"KernelDeadlock","status":"False","transition":"2025-01-02T14:13:18.004033678Z","reason":"KernelHasNoDeadlock","message":"kernel has no deadlock"},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
+ 2025-01-02T14:23:11.137Z {"level":"info","ts":"2025-01-02T14:23:10Z","msg":"NPD caught ","condition type: ":"NvidiaErrorTerminate","with condition details ":{"type":"NvidiaErrorTerminate","status":"True","transition":"2025-01-02T14:23:10Z","reason":"NvidiaGpuXidError","message":"[ 378.703529] NVRM: Xid (PCI:0000:b9:00): 74, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)"},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
+ 2025-01-02T14:23:11.137Z {"level":"info","ts":"2025-01-02T14:23:10Z","msg":"NPD caught ","condition type: ":"NvidiaErrorReboot","with condition details ":{"type":"NvidiaErrorReboot","status":"False","transition":"2025-01-02T14:13:18.004033858Z","reason":"NvidiaNoErrorRequiredReboot","message":"Nvidia GPU reboot not required"},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
+ 2025-01-02T14:23:11.137Z {"level":"info","ts":"2025-01-02T14:23:10Z","msg":"NPD caught ","condition type: ":"NeuronErrorTerminate","with condition details ":{"type":"NeuronErrorTerminate","status":"False","transition":"2025-01-02T14:13:18.004033918Z","reason":"NeuronNoErrorRequiredTerminate","message":"Neuron no error required terminate"},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
+ 2025-01-02T14:23:15.801Z {"level":"info","ts":"2025-01-02T14:23:10Z","msg":"NPD caught ","condition type: ":"NeuronErrorReboot","with condition details ":{"type":"NeuronErrorReboot","status":"False","transition":"2025-01-02T14:13:18.004033978Z","reason":"NeuronNoErrorRequiredReboot","message":"Neuron no error required reboot"},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
まとめ
以上、「SageMaker HyperPod の方の自動ノード修復機能を試してみた」でした。
ノードの異常を検出すると自動的に入れ替えを行うのは大変頼もしいですね。このブログがどなたかの参考になれば幸いです。
クラウド事業本部コンサルティング部のたかくに(@takakuni_)でした!
health-monitoring agent の Helm チャートを見ると nodeAffinity に特定のインスタンスタイプが指定されているため GPU が必要といった具合です。https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/HyperPodHelmChart/charts/health-monitoring-agent/templates/health-monitoring-agent.yaml#L66-L91 ↩︎